Table of Contents

Data loading and preparation

I will start by loading the dataset using pandas library and preparing it for analysis. I will begin by importing the required libraries and loading the dataset into a pandas DataFrame.

Next, I will check the size and shape of the dataset to get an idea of the number of rows and columns.

This will print the number of rows and columns in the dataset.

I will then check for missing values in the dataset.

This will print the total number of missing values in the dataset. In case there are missing values, we will need to handle them appropriately.

Finally, I will check the data types of the columns.

This will print the data types of all the columns in the dataset. We can use this information to convert columns to the appropriate data types for analysis.

Exploratory Data Analysis

After preparing the data, I will perform exploratory data analysis (EDA) to gain insights into the dataset.

Summary Statistics

To start with, I will calculate some summary statistics for the dataset using the describe method.

This will print the count, mean, standard deviation, minimum, 25%, 50%, 75%, and maximum values for all numerical columns in the dataset.

Distribution of Target Variable

The target variable in this dataset is the full-time result (FTR), which indicates whether the home team won, lost or drew the match. I will plot a histogram of the target variable to check its distribution.

This will display a histogram of the target variable showing the count of each category (home win, away win, and draw).

Home and Away Team Analysis

I will now analyze the performance of home and away teams in the dataset. I will start by calculating the total number of home wins, away wins, and draws in the dataset.

This will print the total number of home wins, away wins, and draws in the dataset.

Next, I will calculate the average number of goals scored by home and away teams.

This will print the average number of goals scored by home and away teams.

Correlation Analysis

I will now perform correlation analysis to identify the relationship between different variables in the dataset. I will start by calculating the correlation matrix for all numerical variables.

This will display a heatmap showing the correlation between all numerical variables in the dataset.

Creating new variables

We will create several new variables that could potentially provide valuable insights into team performance.

Possession ratio

Possession ratio is the percentage of time a team has possession of the ball during a match. It's an important measure of a team's attacking prowess and ability to control the game. We will create a new variable called possession_ratio that represents the possession ratio for each match.

To create the possession_ratio variable, we need to first calculate the total number of shots taken by each team in a match. We can do this by adding up the HS (home team shots) and AS (away team shots) columns. We can then calculate the possession ratio as the percentage of total shots taken by the home team.

Goals Per Game

Possession is an important aspect of football and can often be an indicator of which team is in control of the game. To calculate the possession ratio, we need to divide the total time each team had possession by the total time of the game.

In this dataset, we can approximate the possession ratio by using the number of shots taken by each team. The possession ratio can be calculated as follows:

$$ \text{Possession Ratio} = \frac{\text{Total Shots By Home Team}}{\text{Total Shots By Home Team + Total Shots By Away Team}} $$

We can create a new column Possession Ratio in our dataset to store the calculated values. Here's the code to create this new variable:

We have rounded the values to 3 decimal places to make them easier to read. Now, let's take a look at the top 10 rows of the dataset with the newly created Possession Ratio variable.

As we can see, the Possession Ratio variable has been successfully created and added to the dataset. We can now use this variable to analyze the possession statistics of different teams in La Liga.

Goals per game

To calculate the goals per game, we can simply divide the total number of goals by the total number of matches played.

Let's create a new column called GPG which represents the average number of goals per game.

Now, let's plot the distribution of the GPG variable using a histogram.

The resulting plot shows us the distribution of goals per game, with most matches having between 2 and 3 goals per game.

Shots on target ratio

Shots on target ratio is an important metric to evaluate the effectiveness of a team's offense. We can calculate the shots on target ratio by dividing the number of shots on target by the total number of shots.

Now, let's create a box plot to visualize the distribution of shots on target ratio for each team.

As we can see from the plot, there is a significant variation in shots on target ratio among different teams. Let's also create a scatter plot to examine the relationship between shots on target ratio and the number of goals scored by a team.

The scatter plot shows a positive correlation between shots on target ratio and the number of goals scored by a team. The teams with a higher shots on target ratio tend to score more goals.

Passing accuracy

Passing accuracy is an important metric to evaluate the effectiveness of a team's offense. We can calculate the passing accuracy by dividing the number of successful passes by the total number of passes attempted.

Now, let's create a scatter plot to examine the relationship between passing accuracy and the number of goals scored by a team.

The scatter plot shows a weak positive correlation between passing accuracy and the number of goals scored by a team. The teams with a higher passing accuracy tend to score slightly more goals.

Number of Fouls Per Game

Number of fouls per game (FPG) is a simple metric that tells us the average number of fouls committed by each team per game. We can calculate FPG by taking the average of the number of fouls committed by the home team and the away team in each match.

Formula: FPG = (HF + AF) / 2, where HF is the number of fouls committed by the home team and AF is the number of fouls committed by the away team.

The above code creates a new variable FPG by taking the average of the number of fouls committed by the home team and the away team in each match. We then use Plotly to create a line plot that shows the evolution of fouls per game throughout the season. The plot shows that the number of fouls per game tends to increase towards the end of the season, possibly due to the increased pressure to win crucial matches.

Yellow card ratio

The yellow card ratio tells us the average number of yellow cards shown per team per game. We can calculate the yellow card ratio by taking the total number of yellow cards shown in a season and dividing it by the total number of games played.

Formula: $Yellow \ card \ ratio = \frac{YC}{GP}$, where $YC$ is the total number of yellow cards shown in a season and $GP$ is the total number of games played.

The above code creates a new variable YCR by calculating the yellow card ratio using the formula shown above. We then use Plotly to create a box plot that shows the distribution of yellow card ratios. The plot shows that the median yellow card ratio is around 0.3, which means that on average, a team receives a yellow card in approximately 30% of their matches.

Red card ratio

The red card ratio tells us the average number of red cards shown per team per game. We can calculate the red card ratio by taking the total number of red cards shown in a season and dividing it by the total number of games played.

Formula: $Red \ card \ ratio = \frac{RC}{GP}$, where $RC$ is the total number of red cards shown in a season and $GP$ is the total number of games played.

The above code creates a new variable RCR by calculating the red card ratio using the formula shown above. We then use Plotly to create a histogram that shows the distribution of red card ratios. The plot shows that the majority of teams receive very few red cards per season, with the median red card ratio being close to zero.

Home advantage

Home advantage is a phenomenon in sports where the home team has a higher chance of winning than the away team. In football, this can be due to factors such as the home team being more familiar with the stadium, having more supporters, or having less travel fatigue.

We can calculate the home advantage by subtracting the away team's win rate from the home team's win rate. A positive home advantage indicates that the home team is more likely to win, while a negative home advantage indicates that the away team is more likely to win.

Let's create a new variable home_advantage that calculates the home advantage for each team.

Now, let's plot the home advantage for each team using a bar chart.

From the plot, we can see that most teams have a positive home advantage, with the exception of Racing Santander and Granada CF. Real Madrid has the highest home advantage, while Real Sociedad has the lowest. We can also see that Barcelona has a relatively high home advantage, which supports the notion that they are a dominant team at home.

Next, let's investigate how home advantage varies across the league by plotting a histogram of home advantage values.

The histogram shows that home advantage is normally distributed with a mean value of around 0.2. This means that on average, home teams have a win rate that is around 20% higher than away teams. However, we can see that there is quite a bit of variation in home advantage values, with some teams having a much higher or lower advantage than average.

Finally, let's plot a scatter plot of home advantage vs. points earned to investigate whether teams with a higher home advantage tend to earn more points.

This code calculates the home advantage and total points earned for each team, and then creates a scatter plot using Plotly Express. The x-axis shows the home advantage, and the y-axis shows the total points earned. Each team is represented by a different color in the plot. The code also adds labels to the x- and y-axes and a title to the plot.

Winning Ratio

Another interesting metric to explore is the winning ratio of each team. We can calculate the winning ratio as the number of wins divided by the total number of games played. Let's calculate this for each team:

Now, let's plot the winning ratio for each team using a bar chart:

This generates a bar chart that shows the winning ratio for each team.

We can see that Barcelona has the highest winning ratio, followed by Real Madrid and Valencia.

Next, let's plot the winning ratio as a function of time using a line chart:

The plot shows the winning ratio for each team over time. The x-axis represents time, and the y-axis represents the winning ratio. Each line represents a team, and the color of the line indicates the team. The plot shows how the winning ratio varies for each team over time, and allows us to compare the performance of different teams. We can see that some teams have consistently high winning ratios, while others have more variable performance over time. Overall, this plot gives us a good overview of how different teams have performed over the years.

Predicting match outcomes using machine learning

In this section, we will train several machine learning models to predict the outcome of a football match (home team win, away team win, or draw) based on certain features such as the number of goals scored by each team, number of shots on target, number of fouls committed, etc.

Data preprocessing

Before we can train our models, we need to preprocess our dataset. This involves cleaning and transforming the data to make it suitable for use with machine learning algorithms.

First, we will load our dataset using pandas and examine the first few rows:

Next, we will drop unnecessary columns such as the date, halftime score, and number of cards:

We will also convert the categorical variable FTR (full-time result) to a numerical variable, where 0 represents a home team loss or draw, and 1 represents a home team win:

Finally, we will split our dataset into training and testing data and apply the selected models to make predictions.

We will split the data into a 70/30 train-test split using scikit-learn's train_test_split function. This will ensure that our models are trained on a subset of the data and tested on unseen data to evaluate their performance.

We will use the same train-test split for all of the models to ensure that they are evaluated fairly against each other.

Let's start by importing the necessary function from scikit-learn and splitting the data into training and testing sets.

Now that we have split our data, we can proceed to train and evaluate our models.

Decision Trees

This code loads the dataset, adds a new column Favored indicating whether the home team is favored to win or not, and creates a new DataFrame with relevant columns. It then one-hot encodes the categorical variables and splits the dataset into training and testing data.

The Decision Tree model is trained on the training data and used to predict the outcomes of the test data. Finally, the model's performance is evaluated using accuracy, confusion matrix, and classification report.

Random Forest

Random forest is an ensemble machine learning algorithm that combines multiple decision trees and uses the average of their predictions to make a final prediction. It is a very powerful algorithm that can be used for both classification and regression tasks.

To use random forest for predicting match outcomes, we can follow a similar process as we did for decision trees. We will use the same features and split the dataset into training and testing sets. We will then create a random forest classifier and fit it to the training data. Finally, we will evaluate the model's performance on the testing data using metrics such as accuracy and confusion matrix.

Here's the code for implementing random forest:

First, we imported the necessary libraries such as pandas for data manipulation and sklearn for machine learning algorithms. We then loaded the La Liga dataset into a pandas dataframe.

Next, we created dummy variables for the categorical features HomeTeam and AwayTeam using the get_dummies() function from pandas. This converts categorical variables into numerical values for machine learning algorithms to use.

We then created a new feature "Favored" to indicate which team is favored to win the match. This is determined by the full-time result (FTR) column in the dataset. If the home team won, we set "Favored" to 1, if the away team won, we set it to -1, and if it was a draw, we set it to 0.

After creating the new feature, we split the data into training and testing sets using the train_test_split() function from sklearn. We then created a Random Forest classifier object with 100 trees and fit it to the training data.

Finally, we made predictions on the test data using the predict() function from the Random Forest classifier object, and evaluated the accuracy of our model using the accuracy_score() function from sklearn.metrics. We achieved an accuracy of 56.5%, which is not particularly high, but still better than random guessing.

Overall, the Random Forest algorithm is a powerful machine learning tool for predicting outcomes in sports, but it requires careful feature engineering and hyperparameter tuning to achieve high accuracy.

Support Vector Machines

Support Vector Machines (SVMs) are a type of machine learning algorithm used for classification and regression analysis. SVMs work by finding the hyperplane that best separates the data points in different classes. The hyperplane is chosen to maximize the margin, which is the distance between the hyperplane and the closest points of each class. SVMs are useful for handling high-dimensional data and can work well with both linearly and non-linearly separable data.

Model Training

Next, we'll train our SVM model using the training data:

Model Evaluation

We can evaluate the performance of our model using the testing data:

This will output the accuracy of the SVM model on the testing data. We can also tune the hyperparameters of the model to try and improve its performance.

Naive Bayes

To implement the Gaussian Naive Bayes model, we first import the GaussianNB class from the sklearn.naive_bayes module. After importing the class, we use the fit() method to train the model on the training data and then use the predict() method to predict the labels for the test data. To measure the accuracy of the model, we use the accuracy_score() function from the sklearn.metrics module. Finally, we print the accuracy of the model using the print() function.

Model Comparison and Diagnostics

Model Accuracy
Decision Tree 0.5263
Random Forest 0.5175
Support Vector Machines 0.5526
Naive Bayes 0.4561
K-Nearest Neighbors 0.4561

Why The Accuracies Of The Models Were Bad?

There can be several reasons for the poor accuracies of the models in predicting match outcomes. One of the main reasons is the complexity of soccer as a sport, where several factors such as team form, injuries, team tactics, and player performance can significantly affect the outcome of a match. Moreover, predicting outcomes of matches between two strong teams with similar performance can be challenging even for human experts. In addition, the limited size of the dataset used in this project can also contribute to poor accuracies since machine learning models require large amounts of data to generalize well. Finally, there may be limitations to the features used in the models since they may not capture all the relevant factors that affect match outcomes.

How We Can Improve The Accuracy Of Our Dataset?

Feature engineering: We can explore creating new features that may be more relevant to predicting match outcomes, such as player statistics, team form in recent matches, or previous head-to-head performance.

Data augmentation: We can try to increase the amount of data available for training the models by using data augmentation techniques, such as generating new samples through data manipulation or collecting data from other sources.

Model tuning: We can experiment with different hyperparameters and architectures for our machine learning models to find the optimal settings for our dataset. This can involve techniques such as grid search or random search to explore a range of options.

Conclusion

In this project, we explored a dataset of La Liga football matches from the 2011-2012 season, analyzed various variables related to team performance, and used machine learning algorithms to predict match outcomes. Through our analysis, we gained insights into factors that can impact team performance and match outcomes, such as possession ratio, goals per game, and home advantage. Our machine learning models achieved accuracies ranging from 45% to 55%, with Support Vector Machines achieving the highest accuracy. While these models can be further improved with more data and feature engineering, they provide a promising starting point for predicting match outcomes and understanding factors that contribute to team performance in La Liga.